Within months, COVID-19 went from an epidemic to a pandemic and has affected people's lives in many ways all over the world. From the first identified case in December 2019, how did the virus spread so fast and widely?
Out of curiosity, I explored the open dataset from the World Health Organization website and tried to applied the analysis skills all together I have learned to this ongoing real-world problem.
The main dataset was acquired from the World Health Organization website, date ranging from January 11, 2020, as to June 20, 2020.
In the first part, I mainly used Python programming with Numpy and Pandas packages for data wrangling including clean and merge datasets. In terms of exploratory data analysis, several visualization tools used include Matplotlib and seaborn. I also self-learned "plotly express", a high-level interface to create many fancy interactive and informative visualizations.
Secondly, I Integrated Jupyter notebook and MySQL Workbench through API to migrate data from Pandas DataFrame to SQL table
Last but not least, I connected the MySQL server to Tableau. I built a dashboard and deployed it to the public website for offering interactive data visualization.
# Import the relevant libraries
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Suppress all warning
import warnings
warnings.simplefilter(action='ignore')
# Expand the output display to see more columns of DataFrame
pd.set_option('display.max_columns', 50)
# print all the outputs in a cell
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# Load the data
df_WHO = pd.read_csv('WHO_COVID19_globaldata.csv')
# Eyeball the data
df_WHO.head()
# There are 23022 observations & 8 columns in this DataFrame
df_WHO.shape
df_WHO.info()
# Convert all Date to datetime format
df_WHO['Date_reported']=pd.to_datetime(df_WHO['Date_reported'])
# Check again column type and we can see Date_reported is recorded under UTC timezone
df_WHO.info()
# Date ranges from January 11, 2020, as of June 20, 2020.
df_WHO.Date_reported.min(), df_WHO.Date_reported.max()
df_WHO.describe(include ='all')
# Taiwan is excluded from the World Health Assembly
df_WHO[df_WHO['Country'].str.startswith('Taiwan')]
df_WHO.isna().any()
df_WHO.isna().sum()
# Missing Country_code are all from Namibia
df_WHO[df_WHO.Country_code.isna() == True]['Country'].nunique()
df_WHO[df_WHO.Country_code.isna() == True].head(3)
# For missing values in WHO_region columns are all undefined Country, later will remove them
df_WHO[df_WHO.WHO_region.isna() == True].head(3)
The World Health Organization (WHO) divides the world into six WHO regions, for the purposes of reporting, analysis and administration.

df_WHO.Country_code.fillna('NA',inplace=True)
df_WHO.isna().sum()
# Drop the rows with WHO_region = NaN
df_WHO.dropna(how='any',subset=['WHO_region'],inplace=True)
# Check: Now I finished the first-step data cleaning process
df_WHO.isna().sum()
# Load the data (This dataset includes Taiwan)
df = pd.read_csv('covid-data.csv')
df.head()
# Extract Taiwan observations only
df_TW = df[df['location'] == 'Taiwan']
df_TW.head()
# There are lots of unnecessary columns and later will remove them
df_TW.info()
# Only fetch the fields common with df_WHO DataFrame
df_TW = df_TW[['date','iso_code','location','new_cases',\
'total_cases','new_deaths','total_deaths']]
df_TW.head()
# Taiwan belongs to WPRO region, I added WHO_region columns in df_TW DataFrame
df_TW.insert(3,'WHO_region','WPRO')
df_TW.head(5)
# Make the date column of type datetime(UTC timezone)
df_TW['date']=pd.to_datetime(df_TW['date']).dt.tz_localize('UTC')
df_TW.info()
df_TW['date'].min(), df_TW['date'].max()
df_TW.head()
df_WHO.columns.tolist()
# Re-assign df_TW column name to be consistent with that of df_WHO
df_TW.columns = df_WHO.columns.tolist()
df_TW.head()
# Remove the date with 0 Cumulative_cases
df_TW = df_TW[df_TW['Cumulative_cases'] != 0]
# Reset index starting from 0
df_TW.index = range(len(df_TW))
df_TW.head()
df_TW.isna().sum()
df_TW.isna().sum()
# Missing values are all from the same record
df_TW[df_TW['New_cases'].isna()]
# Now I can extract specific records based on the internal index
df_TW.iloc[45:50,]
# Should fill values same as the previous record to make it complete
df_TW.fillna(method='ffill',inplace = True)
# Check it again
df_TW.iloc[45:50,]
# Now I finished the second step data cleaning procedure
df_TW.isna().sum()
df_World = pd.concat([df_WHO, df_TW], axis = 0)
df_World.head()
df_World.tail()
# Reset index starting from 0
df_World.index = range(len(df_World))
df_World.tail()
df_World.shape
df_World.info()
df_World.describe(include ='all')
df_World.Date_reported.min()
df_World[(df_World.Date_reported == df_World.Date_reported.min())]
df_World.Date_reported.nunique()
df_World[(df_World.Date_reported < '2020-02-01') & (df_World.Cumulative_cases != 0)]\
['Country'].nunique()
df_World['Country'].nunique()
df_death = df_World[df_World.Cumulative_deaths != 0]
df_death.head()
Confirmed_death = df_death.groupby(by = 'Date_reported')['Country'].nunique()
Confirmed_death
Confirmed_death[Confirmed_death == 10].index[0]
df_region = df_World.groupby('WHO_region')[['New_cases','New_deaths']].sum().reset_index().\
rename(columns={'New_cases':'Cumulative_cases','New_deaths':'Cumulative_deaths'})
df_region.head()
The Case Fatality Rate (CFR) is the ratio between confirmed deaths and confirmed cases.
df_region['CFR']= round((df_region['Cumulative_deaths']/df_region['Cumulative_cases']),3)
df_region.sort_values('CFR', ascending = False)
# Bar chart
sns.catplot(x='WHO_region', y='Cumulative_cases',data=df_region, kind='bar',aspect=1.5)
sns.catplot(x='WHO_region', y='Cumulative_deaths',data=df_region, kind='bar',aspect=1.5)
# pie chart
fig = plt.figure(2, figsize=(6,6))
labels=['AFRO', 'AMRO', 'EMRO', 'EURO','SEARO','WPRO']
explode=(0.1,0,0.1,0,0.1,0.1)
_ = plt.pie(df_region['Cumulative_cases'],explode=explode,labeldistance=1.35,pctdistance=1.2, \
labels=labels,autopct='%1.1f%%',
shadow=True,startangle=-40)
_ = plt.title('Cumulative_cases across 6 WHO Region')
plt.show()
fig = plt.figure(2, figsize=(6,6))
labels=['AFRO', 'AMRO', 'EMRO', 'EURO','SEARO','WPRO']
explode=(0.1,0,0.1,0,0.1,0.1)
_ = plt.pie(df_region['Cumulative_deaths'],explode=explode,labeldistance=1.35,pctdistance=1.2, \
labels=labels,autopct='%1.1f%%',
shadow=True,startangle=-40)
_ = plt.title('Cumulative_deaths across 6 WHO Region')
plt.show()
import plotly.express as px
# bar chart
fig = px.bar(df_region, x="WHO_region", y="Cumulative_deaths", \
color="WHO_region")
fig.show()
# pie chart
fig = px.pie(df_region, values='Cumulative_deaths', names='WHO_region', \
title='Cumulative_deaths across 6 WHO Region')
fig.show()
df_country = df_World.groupby(by = ['WHO_region','Country','Country_code'])\
[['New_cases','New_deaths']].sum()\
.rename(columns={'New_cases':'Cumulative_cases','New_deaths':'Cumulative_deaths'})
df_country.head()
# Alternatively
df_country = df_World.groupby(by = 'Country')[['Country_code','Cumulative_cases',\
'Cumulative_deaths','WHO_region']].max()
df_country.head()
df_country.sort_values(by=['Cumulative_cases'], ascending = False).head()
# Alternatively
df_country['Cumulative_cases'].nlargest(5)
df_country['Cumulative_cases'].nsmallest(5)
df_country['Cumulative_deaths'].nlargest(5)
# Calculate Case Fatality Rate (CFR)
df_country['CFR']= round((df_country['Cumulative_deaths']/df_country['Cumulative_cases']),3)
df_country.head()
df_country.sort_values(by = 'CFR', ascending = False).head(5)
# In order to plot the data, reset the index
df_country = df_country.reset_index()
df_country.head()
# scatter plot:
# It seems that there is a positive relationship between 'Cumulative_deaths' &'Cumulative_cases'
fig = px.scatter(df_country, x="Cumulative_deaths", y='Cumulative_cases',\
trendline="ols",hover_name="Country")
fig.show()
# scatter plot: larger circle means higher CFR for that country
# United States of America on the right top is an outlier
fig = px.scatter(df_country, x="Cumulative_deaths", y='Cumulative_cases',\
size="CFR", color="WHO_region", hover_name="Country", size_max=30, \
template = 'plotly_dark')
fig.show()
# treemaps:larger size of rectangles means more Cumulative_cases
# Color the closer to blue means higher CFR(Case Fatality Rate.
# As we can see France has high CFR.
fig = px.treemap(df_country, path=['WHO_region', 'Country'], values='Cumulative_cases', \
color='CFR',hover_data=['Country'], color_continuous_scale='RdBu')
fig.show()
# maps
# Something weird: It only shows Taiwan but not any other countries
# Assume it is a problem of Country_code (Taiwan is the only one whose Country_code is 3-digit)
fig = px.scatter_geo(df_country, locations="Country_code", color='CFR', hover_name="Country",\
projection="natural earth")
fig.show()
df_country.replace('TWN', 'TW', inplace = True)
df_country[df_country.Country == 'Taiwan']
I managed to change Country_code from current 2-digit to 3-digit
# Load the data
iso_alpha = pd.read_csv('wikipedia_isocode.csv')
iso_alpha.head()
# drop unnecessary columns
iso_alpha.drop(columns = ['English short name lower case',\
'Numeric code','ISO 3166-2'], inplace = True)
iso_alpha.shape
# Merge df_country & iso_alpha
df = df_country.merge(iso_alpha, left_on='Country_code',right_on='Alpha-2 code', how='left')
df.head()
# Find out those not been matched: there are 6 records in total
df['Alpha-3 code'].isna().sum()
iso_missing = df[df['Country_code'] != df['Alpha-2 code']]
iso_missing
source: https://countrycode.org/
iso_missing['Alpha-3 code'] = ['SSD','XKX','SXM','CUW','BES','NAM']
iso_missing.head()
# Extract those countries Alpha-3 code are matched
df_noNA = df[df['Country_code'] == df['Alpha-2 code']]
# Concatenate df_noNA & iso_missing
df = pd.concat([df_noNA, iso_missing], axis = 0)
df.tail()
df['Alpha-3 code'].isna().sum()
# maps: Try it again and get the correct one this time
# Color the closer to yellow means higher CFR(Case Fatality Rate).
# As we can see Yemen stands out.
fig = px.choropleth(df, locations="Alpha-3 code", color='CFR', hover_name="Country", \
projection="natural earth")
fig.show()
By exploring the number of new cases & new deaths to see whether the epidemic has turned to be stable or slowed down
df_World.head()
df_date_case = df_World.groupby('Date_reported')['New_cases'].sum().reset_index()
df_date_case.head()
df_date_case['binned_Dates'] = pd.cut(df_date_case.Date_reported, 5)
df_date_case.head()
sns.catplot(y='binned_Dates', x='New_cases', data=df_date_case, kind='bar',aspect=3)
df_date_death = df_World.groupby('Date_reported')['New_deaths'].sum().reset_index()
df_date_death.head()
df_date_death['binned_Dates'] = pd.cut(df_date_death.Date_reported, 5)
df_date_death.head()
sns.catplot(y='binned_Dates', x='New_deaths', data=df_date_death, kind='bar',aspect=3)
df_date = df_World.groupby('Date_reported').sum()[['New_cases','New_deaths']]
df_date.head()
df_date.plot(title = 'New cases & New deaths over time for all over the world')
# First I need to make data type of Date_reported string
# Otherwise, it will hit an error later while using Plotly Express
df_date = df_date.reset_index()
df_date['Date_reported'] = df_date['Date_reported'].astype('str').apply(lambda x : x[:10])
df_date.head()
fig = px.line(df_date, x='Date_reported', y='New_cases', \
labels={'y':'New_cases'},
title='New cases over time for all over the world' )
fig.show()
fig = px.line(df_date, x='Date_reported', y='New_deaths', \
labels={'y':'New_deaths'},
title='New deaths over time for all over the world')
fig.show()
country = input("Please input the country name:")
df_country = df_World[df_World.Country == country].set_index('Date_reported')\
[['New_cases','New_deaths']]
df_country.plot( title = 'New cases & New deaths over time for ' + '%s' %country)
country = input("Please input the country name:")
fig = px.line(df_World[df_World.Country == country], x='Date_reported', y='New_cases', \
labels={'y':'New_cases'},
title='New cases over time for ' + '%s' %country)
fig.show()
fig = px.line(df_World[df_World.Country == country], x='Date_reported', y='New_deaths', \
labels={'y':'New_deaths'},
title='New deaths over time for ' + '%s' %country)
fig.show()
df_China = df_World[df_World['Country'] == 'China'].set_index('Date_reported')['Cumulative_cases']
df_China_Jan = df_China[df_China.index < '2020-02-01']
df_China_Jan.head()
df_nonChina = df_World[df_World['Country'] != 'China']
df_nonChina.head()
df_nonChina = df_nonChina.groupby('Date_reported')['Cumulative_cases'].sum()
df_nonChina.head()
df_nonChina_Jan = df_nonChina[df_nonChina.index < '2020-02-01']
plt.plot(df_China_Jan ,'r')
plt.plot(df_nonChina_Jan ,'b')
plt.legend(['Cumulative_cases_China','Cumulative_cases_NonChina']) # åŠ ä¸Šåœ–ä¾‹
plt.show()
df.head() # Each row of this DataFrame is country-based, Date_reported column is not included
iso = df[['Country_code','Alpha-3 code']]
iso.head()
iso.isna().sum()
iso.shape
# change Country_code of df_World from TWN to TW, otherwise it will not be matched
df_World.replace('TWN', 'TW', inplace = True)
df_World.head()
# Merge df_World & iso dataframe
df_plot = df_World.merge(iso, on = 'Country_code', how='left')
df_plot.head()
df_plot.isna().sum()
# In plotly express, the 'Date_reported' should be a string, otherwise it will hit an error
df_plot['Date_reported'] = df_plot.Date_reported.astype('str').apply(lambda x : x[:10])
# Reorder the dataframe based on Date_reported
df_plot = df_plot.sort_values('Date_reported')
df_plot.head()
Explore the diffusion path of confirmed cases around the world.
Built-in Continuous Color Scales in Python source: https://plotly.com/python/builtin-colorscales/
# Color the closer to dark blue, the more serious situation is compared to other countries.
# China gets a lighter color over time indicates that it is getting less serious relative to otehrs
fig = px.choropleth(df_plot, locations="Alpha-3 code", color='Cumulative_cases'\
,hover_name="Country",animation_frame="Date_reported"\
,color_continuous_scale='Teal',template = 'plotly_white')
fig.show()
fig = px.choropleth(df_plot, locations="Alpha-3 code", color='Cumulative_deaths'\
,hover_name="Country",animation_frame="Date_reported"\
,color_continuous_scale='Reds',template = 'plotly_white')
fig.show()
migrate the information stored in the 'df_new_obs' data frame from Python to SQL
df_World.head()
Importing the package will establish the connection between python and SQL which will so to speak build a bridge between the two pieces of software. I managed to establish a connection between workbench and Jupiter
Note: All cells below should be executed only once
import pymysql
To process any data from Python to SQL, I need to specifically designate the database I would like to connect to. Before that I created a schema/ database called "dbWHO" in MySQL Workbench
And then I create a Table called "COVID19" in MySQL dbWHO database
password = 'kitty33697'
conn = pymysql.connect(database = 'dbWHO', user = 'root', password = password)
cursor = conn.cursor()
Executing SQL queries from Python
Python variables can be used as arguments of the .execute() method as well.
The relevant code has to be in SQL and be surrounded by parentheses.
# After I run the cell I obtain 0 which means there's currently no data stored in this table.
cursor.execute('SELECT * FROM COVID19;')
# Make sure any data stored in the current DataFrame to be consistent with SQL syntax
df_World.Date_reported = df_World.Date_reported.astype('str').apply(lambda x : x[:10])
df_World.Date_reported = df_World.Date_reported.str.replace('-','/')
df_World.head(3)
Creating a query and we will store as a string variable in Python and then use as an argument of the execute method. And it is the execute method that will move our data to the designated table in workbench.
The number of rows or values to be inserted into MySQL table with a single query may be limited. A limit has been set to twenty thousand rows and therefore I won't be able to complete the insert all at once.
I split the query into two parts neither of which can contain more than 20000 records.
# Creating the INSERT Statement
insert_query = 'INSERT INTO COVID19 VALUES '
for i in range(10000): # The first 10000 rows
insert_query += '('
# I need a second inner loop here which will repeat the next operation for every J.
for j in range(df_World.shape[1]): # 13 columns
# Turn the value obtained from each column into a string
# finally added to our insert statement
insert_query += "'" + str(df_World.iloc[i][j]) + "', "
insert_query = insert_query[:-2] + '), '
insert_query = insert_query[:-2] + ';'
cursor.execute(insert_query)
conn.commit()
# This cell should be executed before running to the next cell, otherwise it will hit an error
df_World['Country'].replace(to_replace="Lao People's Democratic Republic",\
value='Laos',inplace=True)
insert_query = 'INSERT INTO COVID19 VALUES '
for i in range(10000,df_World.shape[0]): # insert remaining rows
insert_query += '('
# I need a second inner loop here which will repeat the next operation for every J.
for j in range(df_World.shape[1]): # 13 columns
# Turn the value obtained from each column into a string
# finally added to our insert statement
insert_query += "'" + str(df_World.iloc[i][j]) + "', "
insert_query = insert_query[:-2] + '), '
insert_query = insert_query[:-2] + ';'
cursor.execute(insert_query)
conn.commit()
# Now I have added 23035 records to the WHO TABLE.
cursor.execute('SELECT * FROM COVID19;')
# Export to csv
df_World.to_csv('COVID19.csv')